causal inference

安装量: 131
排名: #6593

安装

npx skills add https://github.com/aj-geddes/useful-ai-prompts --skill 'Causal Inference'
Causal Inference
Overview
Causal inference determines cause-and-effect relationships and estimates treatment effects, going beyond correlation to understand what causes what.
When to Use
Evaluating the impact of policy interventions or business decisions
Estimating treatment effects when randomized experiments aren't feasible
Controlling for confounding variables in observational data
Determining if a marketing campaign or product change caused an outcome
Analyzing heterogeneous treatment effects across different user segments
Making causal claims from non-experimental data using propensity scores or instrumental variables
Key Concepts
Treatment
Intervention or exposure
Outcome
Result or consequence
Confounding
Variables affecting both treatment and outcome
Causal Graph
Visual representation of relationships
Treatment Effect
Impact of intervention
Selection Bias
Non-random treatment assignment
Causal Methods
Randomized Controlled Trials (RCT)
Gold standard
Propensity Score Matching
Balance treatment/control
Difference-in-Differences
Before/after comparison
Instrumental Variables
Handle endogeneity
Causal Forests
Heterogeneous treatment effects Implementation with Python import pandas as pd import numpy as np import matplotlib . pyplot as plt import seaborn as sns from sklearn . linear_model import LinearRegression , LogisticRegression from sklearn . preprocessing import StandardScaler from scipy import stats

Generate observational data with confounding

np . random . seed ( 42 ) n = 1000

Confounder: Age (affects both treatment and outcome)

age

np . random . uniform ( 25 , 75 , n )

Treatment: Training program (more likely for younger people)

treatment_prob

0.3 + 0.3 * ( 75 - age ) / 50

Inverse relationship with age

treatment

( np . random . uniform ( 0 , 1 , n ) < treatment_prob ) . astype ( int )

Outcome: Salary (affected by both treatment and age)

True causal effect of treatment: +$5000

salary

40000 + 500 * age + 5000 * treatment + np . random . normal ( 0 , 10000 , n ) df = pd . DataFrame ( { 'age' : age , 'treatment' : treatment , 'salary' : salary , } ) print ( "Observational Data Summary:" ) print ( df . describe ( ) ) print ( f"\nTreatment Rate: { df [ 'treatment' ] . mean ( ) : .1% } " ) print ( f"Average Salary (Control): $ { df [ df [ 'treatment' ] == 0 ] [ 'salary' ] . mean ( ) : .0f } " ) print ( f"Average Salary (Treatment): $ { df [ df [ 'treatment' ] == 1 ] [ 'salary' ] . mean ( ) : .0f } " )

1. Naive Comparison (BIASED - ignores confounding)

naive_effect

df [ df [ 'treatment' ] == 1 ] [ 'salary' ] . mean ( ) - df [ df [ 'treatment' ] == 0 ] [ 'salary' ] . mean ( ) print ( f"\n1. Naive Comparison: $ { naive_effect : .0f } (BIASED)" )

2. Regression Adjustment (Covariate Adjustment)

X

df [ [ 'treatment' , 'age' ] ] y = df [ 'salary' ] model = LinearRegression ( ) model . fit ( X , y ) regression_effect = model . coef_ [ 0 ] print ( f"\n2. Regression Adjustment: $ { regression_effect : .0f } " )

3. Propensity Score Matching

Estimate probability of treatment given covariates

ps_model

LogisticRegression ( ) ps_model . fit ( df [ [ 'age' ] ] , df [ 'treatment' ] ) df [ 'propensity_score' ] = ps_model . predict_proba ( df [ [ 'age' ] ] ) [ : , 1 ] print ( f"\n3. Propensity Score Matching:" ) print ( f"PS range: [ { df [ 'propensity_score' ] . min ( ) : .3f } , { df [ 'propensity_score' ] . max ( ) : .3f } ]" )

Matching: find control for each treated unit

matched_pairs

[ ] treated_units = df [ df [ 'treatment' ] == 1 ] . index for treated_idx in treated_units : treated_ps = df . loc [ treated_idx , 'propensity_score' ] treated_age = df . loc [ treated_idx , 'age' ]

Find closest control unit

control_units

df [ ( df [ 'treatment' ] == 0 ) & ( df [ 'propensity_score' ]

= treated_ps - 0.1 ) & ( df [ 'propensity_score' ] <= treated_ps + 0.1 ) ] . index if len ( control_units )

0 : closest_control = min ( control_units , key = lambda x : abs ( df . loc [ x , 'propensity_score' ] - treated_ps ) ) matched_pairs . append ( { 'treated_idx' : treated_idx , 'control_idx' : closest_control , 'treated_salary' : df . loc [ treated_idx , 'salary' ] , 'control_salary' : df . loc [ closest_control , 'salary' ] , } ) matched_df = pd . DataFrame ( matched_pairs ) psm_effect = ( matched_df [ 'treated_salary' ] - matched_df [ 'control_salary' ] ) . mean ( ) print ( f"PSM Effect: $ { psm_effect : .0f } " ) print ( f"Matched pairs: { len ( matched_df ) } " )

4. Stratification by Propensity Score

df [ 'ps_stratum' ] = pd . qcut ( df [ 'propensity_score' ] , q = 5 , labels = False , duplicates = 'drop' ) stratified_effects = [ ] for stratum in df [ 'ps_stratum' ] . unique ( ) : stratum_data = df [ df [ 'ps_stratum' ] == stratum ] if ( stratum_data [ 'treatment' ] == 0 ) . sum ( )

0 and ( stratum_data [ 'treatment' ] == 1 ) . sum ( )

0 : treated_mean = stratum_data [ stratum_data [ 'treatment' ] == 1 ] [ 'salary' ] . mean ( ) control_mean = stratum_data [ stratum_data [ 'treatment' ] == 0 ] [ 'salary' ] . mean ( ) effect = treated_mean - control_mean stratified_effects . append ( effect ) stratified_effect = np . mean ( stratified_effects ) print ( f"\n4. Stratification by PS: $ { stratified_effect : .0f } " )

5. Visualization

fig , axes = plt . subplots ( 2 , 2 , figsize = ( 14 , 10 ) )

Treatment distribution by age

ax

axes [ 0 , 0 ] treated = df [ df [ 'treatment' ] == 1 ] control = df [ df [ 'treatment' ] == 0 ] ax . hist ( control [ 'age' ] , bins = 20 , alpha = 0.6 , label = 'Control' , color = 'blue' ) ax . hist ( treated [ 'age' ] , bins = 20 , alpha = 0.6 , label = 'Treated' , color = 'red' ) ax . set_xlabel ( 'Age' ) ax . set_ylabel ( 'Frequency' ) ax . set_title ( 'Age Distribution by Treatment' ) ax . legend ( ) ax . grid ( True , alpha = 0.3 , axis = 'y' )

Salary vs Age (colored by treatment)

ax

axes [ 0 , 1 ] ax . scatter ( control [ 'age' ] , control [ 'salary' ] , alpha = 0.5 , label = 'Control' , s = 30 ) ax . scatter ( treated [ 'age' ] , treated [ 'salary' ] , alpha = 0.5 , label = 'Treated' , s = 30 , color = 'red' ) ax . set_xlabel ( 'Age' ) ax . set_ylabel ( 'Salary' ) ax . set_title ( 'Salary vs Age by Treatment' ) ax . legend ( ) ax . grid ( True , alpha = 0.3 )

Propensity Score Distribution

ax

axes [ 1 , 0 ] ax . hist ( df [ df [ 'treatment' ] == 0 ] [ 'propensity_score' ] , bins = 20 , alpha = 0.6 , label = 'Control' , color = 'blue' ) ax . hist ( df [ df [ 'treatment' ] == 1 ] [ 'propensity_score' ] , bins = 20 , alpha = 0.6 , label = 'Treated' , color = 'red' ) ax . set_xlabel ( 'Propensity Score' ) ax . set_ylabel ( 'Frequency' ) ax . set_title ( 'Propensity Score Distribution' ) ax . legend ( ) ax . grid ( True , alpha = 0.3 , axis = 'y' )

Treatment Effect Comparison

ax

axes [ 1 , 1 ] methods = [ 'Naive' , 'Regression' , 'PSM' , 'Stratified' ] effects = [ naive_effect , regression_effect , psm_effect , stratified_effect ] true_effect = 5000 ax . bar ( methods , effects , color = [ 'red' , 'orange' , 'yellow' , 'lightgreen' ] , alpha = 0.7 , edgecolor = 'black' ) ax . axhline ( y = true_effect , color = 'green' , linestyle = '--' , linewidth = 2 , label = f'True Effect ($ { true_effect : .0f } )' ) ax . set_ylabel ( 'Treatment Effect ($)' ) ax . set_title ( 'Treatment Effect Estimates by Method' ) ax . legend ( ) ax . grid ( True , alpha = 0.3 , axis = 'y' ) for i , effect in enumerate ( effects ) : ax . text ( i , effect + 200 , f'$ { effect : .0f } ' , ha = 'center' , va = 'bottom' ) plt . tight_layout ( ) plt . show ( )

6. Doubly Robust Estimation

from sklearn . ensemble import RandomForestRegressor

Propensity score model

ps_model_dr

LogisticRegression ( ) . fit ( df [ [ 'age' ] ] , df [ 'treatment' ] ) ps_scores = ps_model_dr . predict_proba ( df [ [ 'age' ] ] ) [ : , 1 ]

Outcome model

outcome_model

RandomForestRegressor ( n_estimators = 50 , random_state = 42 ) outcome_model . fit ( df [ [ 'treatment' , 'age' ] ] , df [ 'salary' ] )

Doubly robust estimator

treated_mask

df [ 'treatment' ] == 1 control_mask = df [ 'treatment' ] == 0

Adjust for propensity score

treated_adjusted

( treated_mask . astype ( int ) * df [ 'salary' ] ) / ( ps_scores + 0.01 ) control_adjusted = ( control_mask . astype ( int ) * df [ 'salary' ] ) / ( 1 - ps_scores + 0.01 )

Outcome predictions

pred_treated

outcome_model . predict ( df [ [ 'treatment' , 'age' ] ] . replace ( { 'treatment' : 0 , 1 : 1 } ) ) pred_control = outcome_model . predict ( df [ [ 'treatment' , 'age' ] ] . replace ( { 'treatment' : 1 , 0 : 0 } ) ) dr_effect = treated_adjusted . sum ( ) / treated_mask . sum ( ) - control_adjusted . sum ( ) / control_mask . sum ( ) print ( f"\n6. Doubly Robust Estimation: $ { dr_effect : .0f } " )

7. Heterogeneous Treatment Effects

print
(
f"\n7. Heterogeneous Treatment Effects (by Age Quartile):"
)
for
age_q
in
pd
.
qcut
(
df
[
'age'
]
,
q
=
4
,
duplicates
=
'drop'
)
.
unique
(
)
:
mask
=
(
df
[
'age'
]
>=
age_q
.
left
)
&
(
df
[
'age'
]
<
age_q
.
right
)
stratum_data
=
df
[
mask
]
if
(
stratum_data
[
'treatment'
]
==
0
)
.
sum
(
)
>
0
and
(
stratum_data
[
'treatment'
]
==
1
)
.
sum
(
)
>
0
:
treated_mean
=
stratum_data
[
stratum_data
[
'treatment'
]
==
1
]
[
'salary'
]
.
mean
(
)
control_mean
=
stratum_data
[
stratum_data
[
'treatment'
]
==
0
]
[
'salary'
]
.
mean
(
)
effect
=
treated_mean
-
control_mean
print
(
f" Age
{
age_q
.
left
:
.0f
}
-
{
age_q
.
right
:
.0f
}
$ { effect : .0f } " )

8. Sensitivity Analysis

print ( f"\n8. Sensitivity Analysis (Hidden Confounder Impact):" )

Vary hidden confounder correlation with outcome

for
hidden_effect
in
[
1000
,
2000
,
5000
,
10000
]
:
adjusted_effect
=
regression_effect
-
hidden_effect
*
0.1
print
(
f" If hidden confounder worth $
{
hidden_effect
}
Effect = $ { adjusted_effect : .0f } " )

9. Summary Table

print ( f"\n" + "=" * 60 ) print ( "CAUSAL INFERENCE SUMMARY" ) print ( "=" * 60 ) print ( f"True Treatment Effect: $ { true_effect : ,.0f } " ) print ( f"\nEstimates:" ) print ( f" Naive (BIASED): $ { naive_effect : ,.0f } " ) print ( f" Regression Adjustment: $ { regression_effect : ,.0f } " ) print ( f" Propensity Score Matching: $ { psm_effect : ,.0f } " ) print ( f" Stratification: $ { stratified_effect : ,.0f } " ) print ( f" Doubly Robust: $ { dr_effect : ,.0f } " ) print ( "=" * 60 )

10. Causal Graph (Text representation)

print
(
f"\n10. Causal Graph (DAG):"
)
print
(
f"""
Age → Treatment ← (Selection Bias)
↓ ↓
└─→ Salary
Interpretation:
- Age is a confounder
- Treatment causally affects Salary
- Age directly affects Salary
- Age affects probability of Treatment
"""
)
Causal Assumptions
Unconfoundedness
No unmeasured confounders
Overlap
Common support on propensity scores
SUTVA
No interference between units
Consistency
Single version of treatment
Treatment Effect Types
ATE
Average Treatment Effect (overall)
ATT
Average Treatment on Treated
CATE
Conditional Average Treatment Effect
HTE
Heterogeneous Treatment Effects
Method Strengths
RCT
Gold standard, controls all confounders
Matching
Balances groups, preserves overlap
Regression
Adjusts for covariates
Instrumental Variables
Handles endogeneity
Causal Forests
Learns heterogeneous effects Deliverables Causal graph visualization Treatment effect estimates Sensitivity analysis Heterogeneous treatment effects Covariate balance assessment Propensity score diagnostics Final causal inference report
返回排行榜